Add Q4_K/Q5_K/Q6_K GPU support via Q8_0 dequantization by AdamBien · Pull Request #108 · beehive-lab/GPULlama3.java

AdamBien · 2026-04-19T10:09:07Z

Add GPU support for K-quant models (Q4_K_M, Q5_K_M, Q6_K) via load-time dequantization to Q8_0
New FloatTensor implementations: Q4_KFloatTensor, Q5_KFloatTensor, Q6_KFloatTensor
Dequantization correctly handles TornadoVM's 16-byte ARRAY_HEADER memory layout
Centralize weight loading log message in AbstractModelLoader (shows actual model quantization, e.g. "Q4_K_M -> Q8_0")

Tested with:
./llamaTornado --gpu --verbose-init --metal --model /Users/abien/work/workspaces/llms/Devstral-Small-2-24B-Instruct-2512-Q4_K_M.gguf --prompt "who are you?" --gpu-memory 30GB

CLAassistant · 2026-04-19T10:09:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

orionpapadakis

LGTM

orionpapadakis · 2026-05-29T11:00:47Z

For future reference. Notice the GGUF Model Load time between "direct" execution (i.e. Llama Q8_0) and "dequantization" (i.e. Llama Q4_K_M). This will be addressed by #118

./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q8_0 -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh! Do you want to hear another one?

==== Performance Metrics ====
achieved tok/s: 67.62. Tokens: 50, seconds: 0.74
GGUF Model Load: 681.63 ms
Compilation & CodeGen: 557.64 ms
Warmup: 2659.84 ms
Read-only weights Copy-in: 453.29 ms

./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh! Do you want to hear another one?

==== Performance Metrics ====
achieved tok/s: 65.47. Tokens: 50, seconds: 0.76
GGUF Model Load: 23599.27 ms
Compilation & CodeGen: 548.25 ms
Warmup: 998.89 ms
Read-only weights Copy-in: 256.44 ms

./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q5_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q5_K_M -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh!

==== Performance Metrics ====
achieved tok/s: 67.68. Tokens: 42, seconds: 0.62
GGUF Model Load: 24139.17 ms
Compilation & CodeGen: 532.88 ms
Warmup: 943.10 ms
Read-only weights Copy-in: 244.08 ms

./llama-tornado --gpu --ptx --model ~/LLMModels/granite-4.0-1b-Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
Sure, here's a joke for you:

Why don't scientists trust atoms?

Because they make up everything!

This joke plays on the double meaning of "make up." In science, atoms are the basic building blocks of matter, and they "make up" all the substances we observe. However, the phrase "make up" is also used to mean fabricate or lie about something. So, the joke suggests that atoms are not trustworthy because they "make up" everything, implying they fabricate or lie about their existence.

==== Performance Metrics ====
achieved tok/s: 15.58. Tokens: 119, seconds: 7.64
GGUF Model Load: 28379.93 ms
Compilation & CodeGen: 527.33 ms
Warmup: 5179.27 ms
Read-only weights Copy-in: 372.44 ms

./llama-tornado --gpu --ptx --model ~/LLMModels/Qwen3-1.7B-Q4_K_M.gguf --prompt "tell me a joke /no_think" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
<think>

</think>

Sure! Here's a light-hearted joke for you:

Why don't scientists trust atoms? Because they never trust **their** colleagues. 😄

Let me know if you want another one!

==== Performance Metrics ====
achieved tok/s: 28.01. Tokens: 59, seconds: 2.11
GGUF Model Load: 30445.23 ms
Compilation & CodeGen: 559.15 ms
Warmup: 3979.90 ms
Read-only weights Copy-in: 355.77 ms

./llama-tornado --gpu --ptx --model ~/LLMModels/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init

WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
 Why did the tomato turn red?

Because it saw the salad dressing!

(This is a play on words, as tomatoes are red before they are added to a salad, and the phrase "saw the salad dressing" is meant to be humorous because tomatoes cannot see.)

==== Performance Metrics ====
achieved tok/s: 15.72. Tokens: 77, seconds: 4.90
GGUF Model Load: 107581.08 ms
Compilation & CodeGen: 628.44 ms
Warmup: 4573.61 ms
Read-only weights Copy-in: 773.35 ms

AdamBien added 2 commits April 19, 2026 12:01

additional information / output added

789d36a

support for Q4 quantization added

58f7e2a

mikepapadim requested review from kotselidis, mairooni, mikepapadim, orionpapadakis and stratika and removed request for orionpapadakis April 19, 2026 10:36

orionpapadakis approved these changes May 12, 2026

View reviewed changes

orionpapadakis mentioned this pull request May 29, 2026

Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K) #118

Open

orionpapadakis closed this May 29, 2026

orionpapadakis reopened this May 29, 2026

orionpapadakis merged commit b94b20f into beehive-lab:main May 29, 2026
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q4_K/Q5_K/Q6_K GPU support via Q8_0 dequantization#108

Add Q4_K/Q5_K/Q6_K GPU support via Q8_0 dequantization#108
orionpapadakis merged 2 commits into
beehive-lab:mainfrom
AdamBien:main

AdamBien commented Apr 19, 2026

Uh oh!

CLAassistant commented Apr 19, 2026

Uh oh!

orionpapadakis left a comment

Uh oh!

orionpapadakis commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AdamBien commented Apr 19, 2026

Uh oh!

CLAassistant commented Apr 19, 2026

Uh oh!

orionpapadakis left a comment

Choose a reason for hiding this comment

Uh oh!

orionpapadakis commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants